library(nycflights13)
library(tidyverse)
library(dplyr)
Find all flights that:
- Had an arrival delay of two or more hours
filter(flights, arr_delay >= 2)
- Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
or
filter(flights, dest %in% c("IAH", "HOU"))
- Were operated by United, American, or Delta
filter(flights, carrier %in% c("UA", "AA", "DL"))
- Departed in summer (July, August, and September)
filter(flights, month %in% c(7, 8, 9))
- Arrived more than two hours late, but didn’t leave late
filter(flights, dep_delay <= 0 & arr_time >= 120)
- Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60 & dep_delay - arr_delay >30 )
- Departed between midnight and 6am (inclusive)
summary(flights$dep_time)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1 907 1401 1349 1744 2400 8255
filter(flights, dep_time %% 2400 <= 600)
Another useful dplyr filtering helper is
between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
This is a shortcut for x >= left & x <= right
between(vector of values, left boundary, right boundary)
filter(flights, between(month, 7, 9))
How many flights have a missing
dep_time? What other variables are missing? What might these rows represent?
count(flights, is.na(dep_time))
summary(flights)
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
NA's :8255
dep_delay arr_time sched_arr_time arr_delay
Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median : -2.00 Median :1535 Median :1556 Median : -5.000
Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
carrier flight tailnum origin
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
dest air_time distance hour minute
Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00 Min. : 0.00
Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00 1st Qu.: 8.00
Mode :character Median :129.0 Median : 872 Median :13.00 Median :29.00
Mean :150.7 Mean :1040 Mean :13.18 Mean :26.23
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:44.00
Max. :695.0 Max. :4983 Max. :23.00 Max. :59.00
NA's :9430
time_hour
Min. :2013-01-01 05:00:00
1st Qu.:2013-04-04 13:00:00
Median :2013-07-03 10:00:00
Mean :2013-07-03 05:22:54
3rd Qu.:2013-10-01 07:00:00
Max. :2013-12-31 23:00:00
How could you use
arrange()to sort all missing values to the start? (Hint: useis.na()).
arrange(flights, desc(is.na(dep_time)))
Sort
flightsto find the most delayed flights. Find the flights that left earliest
arrange(flights, desc(dep_delay))
arrange(flights, dep_delay)
Sort flights to find the fastest (highest speed) flights.
arrange(flights, desc(distance / ((hour * 60) + minute)))
Which flights travelled the farthest? Which travelled the shortest?
arrange(flights, desc(distance))
arrange(flights, distance)
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with('dep'), starts_with('arr'))
select(flights, 4, 6, 7, 9)
NA
What happens if you include the name of a variable multiple times in a select() call?
select(flights, dep_time, dep_time)
What does the any_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
flights %>% select(any_of(vars))
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
Default is to ignore case.
select(flights, contains("TIME"))
select(flights, contains("TIME", ignore.case = FALSE))
mutate()Currently
dep_timeandsched_dep_timeare convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
transmute(flights,
dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
sched_dep_time = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1400)
Compare
air_timewitharr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
select(mutate(flights, arr_dep_time = arr_time - dep_time), air_time, arr_dep_time)
mutate(flights,
arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440) %>%
transmute(air_time, arr_dep_time = arr_time - dep_time)
Still doesn’t solve, other variables could be time zone differences.
Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
select(flights, dep_time, sched_dep_time, dep_delay)
dep_time - sched_dep_time == dep_delay
Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
min_rank() assigns tied values the same rank.
min_rank(c(10, 5, 1, 5, 5))
[1] 5 2 1 2 2
mutate(flights, dep_delay_min_rank = min_rank(desc(dep_delay))) %>%
arrange(dep_delay_min_rank)
What does
1:3 + 1:10return? Why?
1:3 + 1:10
Warning in 1:3 + 1:10 :
longer object length is not a multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
You can only add vectors of different lengths if one is a multiple of another.
1:2 + 1:10
[1] 2 4 4 6 6 8 8 10 10 12
What trigonometric functions does R provide?
These can be viewed in ?Trig documentation.
- cos(x), sin(x), tan(x) - acos(x), asin(x), atan(x), atan2(y, x) - cospi(x), sinpi(x), tanpi(x)
summarise()Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios: 1. A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time. 2. A flight is always 10 minutes late. 3. A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time. 4. 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Come up with another approach that will give you the same output
as not_cancelled %>% count(dest)andnot_cancelled %>% count(tailnum, wt = distance)(without usingcount()).
df count(x, wt) is the same as df %>% group_by(x) %>% summarise(n = sum(wt))
Adds the total tally of distance for each tailnum in this example
not_cancelled <- flights %>%
filter(!is.na(arr_delay), !is.na(dep_delay)) #filter out missing number
not_cancelled %>% count(dest)
not_cancelled %>% count(tailnum, wt = distance)
not_cancelled <- flights %>%
filter(!is.na(arr_delay), !is.na(dep_delay)) #filter out missing numbers
not_cancelled %>%
group_by(dest) %>% #group by destination
summarise(n()) # summarise counts by destination
not_cancelled %>%
group_by(tailnum) %>% #group by destination
summarise(n = sum(distance))
NA
Our definition of cancelled flights is slightly suboptimal. Why? Which is the most important column?
filter(flights, !is.na(dep_delay), is.na(arr_delay)) %>%
select(dep_time, arr_time, sched_arr_time, dep_delay, arr_delay)
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
(flights %>%
mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
group_by(year, month, day) %>%
summarise(
cancelled_num = sum(cancelled),
flights_num = n(),
cancelled_prop = mean(cancelled)
)
)
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n()))
Florida airlines
What does the sort argument to count() do? When might you use it?
Sorts count() in order of n. Basically shorthand of count() %>% arrange()
Which plane (
tailnum) has the worst on-time record ?